Brief

This homework will analyze the dataset which records employee’s personal information. The purpose of this homework is to find all the posible features which might contribute to employee attribution via applying techniques of data visualization and Association Rules Model. Besides, a Shiny App will be created to present the visualization result of Association Rule Model.

Data Importing

Import dataset, convert data type, remove features with no or low variance, remove duplicate, checking and replacing null values, finding and replacing outliers.
Before analyzing the dataset, one should library all the necessary packages.

library(dplyr)
library(arules)
library(knitr)
library(caret)
library(RANN)
library(corrplot)
library(rappdirs)
library(arulesViz)
library(ggplot2)
library(plotly)
library(plyr)
library(tidyr)
library(purrr)
library(rsconnect)
library(shiny)


Importing the Dataset

my.dir <- getwd()
employee_attrition <- read.csv(paste0(my.dir, "/","employee_attrition.csv"), header = TRUE, stringsAsFactors = FALSE)


Checking the data structure.

str(employee_attrition)
## 'data.frame':    1176 obs. of  35 variables:
##  $ Age                     : int  30 52 42 55 35 51 42 23 38 27 ...
##  $ Attrition               : chr  "No" "No" "No" "No" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Non-Travel" ...
##  $ DailyRate               : int  1358 1325 462 177 1029 1318 932 507 1153 1420 ...
##  $ Department              : chr  "Sales" "Research & Development" "Sales" "Research & Development" ...
##  $ DistanceFromHome        : int  16 11 14 8 16 26 1 20 6 2 ...
##  $ Education               : int  1 4 2 1 3 4 2 1 2 1 ...
##  $ EducationField          : chr  "Life Sciences" "Life Sciences" "Medical" "Medical" ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1479 813 936 1278 1529 851 827 1533 1782 667 ...
##  $ EnvironmentSatisfaction : int  4 4 3 4 4 1 4 1 4 3 ...
##  $ Gender                  : chr  "Male" "Female" "Female" "Male" ...
##  $ HourlyRate              : int  96 82 68 37 91 66 43 97 40 85 ...
##  $ JobInvolvement          : int  3 3 2 2 2 3 2 3 2 3 ...
##  $ JobLevel                : int  2 2 2 4 3 4 2 2 1 1 ...
##  $ JobRole                 : chr  "Sales Executive" "Laboratory Technician" "Sales Executive" "Healthcare Representative" ...
##  $ JobSatisfaction         : int  3 3 3 2 2 3 4 3 3 1 ...
##  $ MaritalStatus           : chr  "Married" "Married" "Single" "Divorced" ...
##  $ MonthlyIncome           : int  5301 3149 6244 13577 8606 16307 6062 2272 3702 3041 ...
##  $ MonthlyRate             : int  2939 21821 7824 25592 21195 5594 4051 24812 16376 16346 ...
##  $ NumCompaniesWorked      : int  8 8 7 1 1 2 9 0 1 0 ...
##  $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr  "No" "No" "No" "Yes" ...
##  $ PercentSalaryHike       : int  15 20 17 15 19 14 13 14 11 11 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 3 3 3 NA ...
##  $ RelationshipSatisfaction: int  3 2 1 4 4 3 4 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  2 1 0 1 0 1 1 0 1 1 ...
##  $ TotalWorkingYears       : int  4 9 10 34 11 29 8 5 5 5 ...
##  $ TrainingTimesLastYear   : int  2 3 6 3 3 2 4 2 3 3 ...
##  $ WorkLifeBalance         : int  2 3 3 3 1 2 3 3 3 3 ...
##  $ YearsAtCompany          : int  2 5 5 33 11 20 4 4 5 4 ...
##  $ YearsInCurrentRole      : int  1 2 4 9 8 6 3 3 4 3 ...
##  $ YearsSinceLastPromotion : int  2 1 0 15 3 4 0 1 0 0 ...
##  $ YearsWithCurrManager    : int  2 4 3 0 3 17 2 2 4 2 ...


summary(employee_attrition)
##       Age         Attrition         BusinessTravel       DailyRate     
##  Min.   :18.00   Length:1176        Length:1176        Min.   : 102.0  
##  1st Qu.:30.00   Class :character   Class :character   1st Qu.: 461.8  
##  Median :36.00   Mode  :character   Mode  :character   Median : 796.0  
##  Mean   :36.96                                         Mean   : 800.4  
##  3rd Qu.:43.00                                         3rd Qu.:1162.0  
##  Max.   :60.00                                         Max.   :1499.0  
##                                                                        
##   Department        DistanceFromHome    Education     EducationField    
##  Length:1176        Min.   :  1.000   Min.   :1.000   Length:1176       
##  Class :character   1st Qu.:  2.000   1st Qu.:2.000   Class :character  
##  Mode  :character   Median :  7.000   Median :3.000   Mode  :character  
##                     Mean   :  9.496   Mean   :2.895                     
##                     3rd Qu.: 14.000   3rd Qu.:4.000                     
##                     Max.   :224.000   Max.   :5.000                     
##                     NA's   :2                                           
##  EmployeeCount EmployeeNumber   EnvironmentSatisfaction    Gender         
##  Min.   :1     Min.   :   1.0   Min.   :1.000           Length:1176       
##  1st Qu.:1     1st Qu.: 499.8   1st Qu.:2.000           Class :character  
##  Median :1     Median :1032.5   Median :3.000           Mode  :character  
##  Mean   :1     Mean   :1036.4   Mean   :2.705                             
##  3rd Qu.:1     3rd Qu.:1574.5   3rd Qu.:4.000                             
##  Max.   :1     Max.   :2068.0   Max.   :4.000                             
##                                                                           
##    HourlyRate     JobInvolvement     JobLevel       JobRole         
##  Min.   : 30.00   Min.   :1.000   Min.   :1.000   Length:1176       
##  1st Qu.: 48.00   1st Qu.:2.000   1st Qu.:1.000   Class :character  
##  Median : 66.00   Median :3.000   Median :2.000   Mode  :character  
##  Mean   : 65.82   Mean   :2.741   Mean   :2.069                     
##  3rd Qu.: 83.00   3rd Qu.:3.000   3rd Qu.:3.000                     
##  Max.   :100.00   Max.   :4.000   Max.   :5.000                     
##                                   NA's   :1                         
##  JobSatisfaction MaritalStatus      MonthlyIncome    MonthlyRate   
##  Min.   :1.00    Length:1176        Min.   : 1009   Min.   : 2094  
##  1st Qu.:2.00    Class :character   1st Qu.: 2954   1st Qu.: 8275  
##  Median :3.00    Mode  :character   Median : 4950   Median :14488  
##  Mean   :2.71                       Mean   : 6526   Mean   :14468  
##  3rd Qu.:4.00                       3rd Qu.: 8354   3rd Qu.:20627  
##  Max.   :4.00                       Max.   :19973   Max.   :26999  
##                                                                    
##  NumCompaniesWorked    Over18            OverTime        
##  Min.   :0.000      Length:1176        Length:1176       
##  1st Qu.:1.000      Class :character   Class :character  
##  Median :2.000      Mode  :character   Mode  :character  
##  Mean   :2.709                                           
##  3rd Qu.:4.000                                           
##  Max.   :9.000                                           
##                                                          
##  PercentSalaryHike PerformanceRating RelationshipSatisfaction
##  Min.   :11.0      Min.   :3.000     Min.   :1.000           
##  1st Qu.:12.0      1st Qu.:3.000     1st Qu.:2.000           
##  Median :14.0      Median :3.000     Median :3.000           
##  Mean   :15.3      Mean   :3.163     Mean   :2.718           
##  3rd Qu.:18.0      3rd Qu.:3.000     3rd Qu.:4.000           
##  Max.   :25.0      Max.   :4.000     Max.   :4.000           
##  NA's   :1         NA's   :1         NA's   :1               
##  StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear
##  Min.   :80    Min.   :0.0000   Min.   :  0.0     Min.   :0.00         
##  1st Qu.:80    1st Qu.:0.0000   1st Qu.:  6.0     1st Qu.:2.00         
##  Median :80    Median :1.0000   Median : 10.0     Median :3.00         
##  Mean   :80    Mean   :0.7959   Mean   : 11.4     Mean   :2.81         
##  3rd Qu.:80    3rd Qu.:1.0000   3rd Qu.: 15.0     3rd Qu.:3.00         
##  Max.   :80    Max.   :3.0000   Max.   :114.0     Max.   :6.00         
##                                 NA's   :2                              
##  WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.747   Mean   : 6.918   Mean   : 4.151    
##  3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :4.000   Max.   :40.000   Max.   :18.000    
##                                                     
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000          Min.   :  0.000     
##  1st Qu.: 0.000          1st Qu.:  2.000     
##  Median : 1.000          Median :  3.000     
##  Mean   : 2.125          Mean   :  4.242     
##  3rd Qu.: 2.000          3rd Qu.:  7.000     
##  Max.   :15.000          Max.   :219.000     
##  NA's   :1

Data Manipulation

Features Management
Removing features with no or low variances

nzv <- nearZeroVar(employee_attrition, saveMetrics = T)
employee_attrition <- employee_attrition[,c(-9,-22,-27)]


Duplicate Management

ifelse(nrow(employee_attrition) == nrow(employee_attrition[!duplicated(employee_attrition),]), 'The dataset does not have duplicate', 'The dataset has duplicate')
## [1] "The dataset does not have duplicate"


Missing values management
Checking for missing value

sum(!complete.cases(employee_attrition))
## [1] 9

There are 9 missing values in the dataset.

row <- which(apply(employee_attrition, 1, function(x) sum(is.na(x))) > 0)
col <- which(apply(employee_attrition, 2, function(x) sum(is.na(x))) > 0)
print(as.data.frame(unique(employee_attrition[row,col])))
##     DistanceFromHome JobLevel PercentSalaryHike PerformanceRating
## 10                 2        1                11                NA
## 33                 6        2                12                 3
## 55                NA        4                13                 3
## 64                 6        2                NA                 3
## 83                 1        1                17                 3
## 89                16       NA                19                 3
## 105               NA        2                14                 3
## 969                2        4                11                 3
## 994               15        2                24                 4
##     RelationshipSatisfaction TotalWorkingYears YearsSinceLastPromotion
## 10                         2                 5                       0
## 33                        NA                17                       1
## 55                         4                21                      15
## 64                         1                20                       1
## 83                         4                NA                       0
## 89                         3                 9                       0
## 105                        2                10                       1
## 969                        4                NA                      11
## 994                        1                15                      NA

Replacing the missing value
Given that the processes of replacement are identical, only one of these processes will be displaced as an exmaple.

employee_attrition$DistanceFromHome[is.na(employee_attrition$DistanceFromHome)] <- round(mean(employee_attrition$DistanceFromHome, na.rm = T))

Except “NA” values, there are also two empty values. Hence, one should find and eliminate these empty value

which(apply(employee_attrition, 1, function(x) sum(x == "")) > 0)
which(apply(employee_attrition, 2, function(x) sum(x == "")) > 0)
employee_attrition <- employee_attrition[c(-1013,-1062),]


Outlier Management
In this process, the mean value of the column will be calculated and rounded. Next, these missing values will be raplced by corresponding mean value.
Before managing outliers, one should convert all the interger data to numeric data and then visualize all the numeric columns to find columns with outliers.

num_var <- sapply(employee_attrition, is.integer) 
employee_attrition[, num_var] <- lapply(employee_attrition[, num_var], as.numeric)

char_var <- sapply(employee_attrition, is.character) 
employee_attrition[, char_var] <- lapply(employee_attrition[, char_var], as.factor)
employee_attrition %>% keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(y = value, fill = "orange")) +
  facet_wrap(~key, scales = "free") +
  geom_boxplot() +
  labs(title = "The Boxplots of Numeric Columns")


For this graph, one can conclude the columns with outliers:
* DistanceFrom Home
* MonthlyIncome
* …
* YearsWithCurrManager

Even though there are several columns with outliers, only three of them will be managed. The reason is that some outliers are plausible to exist in real world. For example, the maximun of “YearsAtCompany” (which is 40) has been recognized as outlier. However, this number is highly possible to exist.
For some extreme values which are ten times than the rest of the data, such as the maximun value of “DistanceFromHome”, these numbers do not make sense at all. Hence these outliers should be replaced.

employee_attrition$DistanceFromHome[employee_attrition$DistanceFromHome 
                                    %in% boxplot.stats(employee_attrition$DistanceFromHome)$out] <- round(median(employee_attrition$DistanceFromHome, na.rm = T))
employee_attrition$TotalWorkingYears[employee_attrition$TotalWorkingYears 
                                     %in% boxplot.stats(employee_attrition$TotalWorkingYears)$out] <- round(median(employee_attrition$TotalWorkingYears, na.rm = T))
employee_attrition$YearsWithCurrManager[employee_attrition$YearsWithCurrManager 
                                        %in% boxplot.stats(employee_attrition$YearsWithCurrManager)$out] <- round(median(employee_attrition$YearsWithCurrManager, na.rm = T))


Data Type Conversion
In this section, in order to analyze the dataset more efficiently. Numeric data will be converted to Factor data via discretization techniques.
Given that all these processes are repetitive, only four conversion processes will be shown here.

employee_attrition$Age_Group <- arules::discretize(employee_attrition$Age, method = "frequency", breaks = 3, labels = c("young", "mid", "elder"))
employee_attrition$DRate_Group <- arules::discretize(employee_attrition$DailyRate, method = "frequency", breaks = 3, labels = c("low", "medium", "high"))
employee_attrition$DistanceFromHome_Group <- arules::discretize(employee_attrition$DistanceFromHome, method = "frequency", breaks = 3, labels = c("short", "medium", "long"))
employee_attrition$Education_Group <- as.factor(employee_attrition$Education)


Final Presentation of Dataset
Checking for Details of the Dataset and save the dataset to local directory.

employee_factors <- employee_attrition[, sapply(employee_attrition, is.factor)]
str(employee_factors)
## 'data.frame':    1174 obs. of  32 variables:
##  $ Attrition                     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ BusinessTravel                : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 3 1 3 3 3 3 3 3 ...
##  $ Department                    : Factor w/ 3 levels "Human Resources",..: 3 2 3 2 2 3 2 2 2 3 ...
##  $ EducationField                : Factor w/ 6 levels "Human Resources",..: 2 2 4 4 2 3 2 2 5 3 ...
##  $ Gender                        : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 1 1 2 1 2 ...
##  $ JobRole                       : Factor w/ 9 levels "Healthcare Representative",..: 8 3 8 1 1 4 5 3 3 9 ...
##  $ MaritalStatus                 : Factor w/ 3 levels "Divorced","Married",..: 2 2 3 1 3 2 2 3 2 1 ...
##  $ OverTime                      : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 1 1 1 ...
##  $ Age_Group                     : Factor w/ 3 levels "young","mid",..: 1 3 3 3 2 3 3 1 2 1 ...
##   ..- attr(*, "discretized:breaks")= num  18 32 40 60
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ DRate_Group                   : Factor w/ 3 levels "low","medium",..: 3 3 1 1 2 3 2 1 3 3 ...
##   ..- attr(*, "discretized:breaks")= num  102 573 1040 1499
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ DistanceFromHome_Group        : Factor w/ 3 levels "short","medium",..: 3 3 3 2 3 3 1 3 2 1 ...
##   ..- attr(*, "discretized:breaks")= num  1 4 10 29
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ Education_Group               : Factor w/ 5 levels "1","2","3","4",..: 1 4 2 1 3 4 2 1 2 1 ...
##  $ EmployeeNumber_Group          : Factor w/ 3 levels "low","medium",..: 3 2 2 2 3 2 2 3 3 1 ...
##   ..- attr(*, "discretized:breaks")= num  1 684 1390 2068
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ EnvironmentSat_Group          : Factor w/ 2 levels "unsatisfactory",..: 2 2 2 2 2 1 2 1 2 2 ...
##   ..- attr(*, "discretized:breaks")= num  1 3 4
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ HourlyRate_Group              : Factor w/ 3 levels "low","medium",..: 3 3 2 1 3 2 1 3 1 3 ...
##   ..- attr(*, "discretized:breaks")= num  30 54 78 100
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ JobInvolvement_Group          : Factor w/ 4 levels "1","2","3","4": 3 3 2 2 2 3 2 3 2 3 ...
##  $ JobLevel_Group                : Factor w/ 5 levels "1","2","3","4",..: 2 2 2 4 3 4 2 2 1 1 ...
##  $ JobSatisfaction_Group         : Factor w/ 2 levels "unsatisfactory",..: 2 2 2 1 1 2 2 2 2 1 ...
##   ..- attr(*, "discretized:breaks")= num  1 3 4
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ MonthlyIncome_Group           : Factor w/ 3 levels "low","medium",..: 2 1 2 3 3 3 2 1 2 1 ...
##   ..- attr(*, "discretized:breaks")= num  1009 3692 6524 19973
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ MonthlyRate_Group             : Factor w/ 3 levels "low","medium",..: 1 3 1 3 3 1 1 3 2 2 ...
##   ..- attr(*, "discretized:breaks")= num  2094 10227 18779 26999
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ NumCompaniesWorked_Group      : Factor w/ 3 levels "small","medium",..: 3 3 3 2 2 2 3 1 2 1 ...
##   ..- attr(*, "discretized:breaks")= num  0 1 3 9
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ PercentSalaryHike_Group       : Factor w/ 4 levels "[11,12)","[12,14)",..: 3 4 3 3 4 3 2 3 1 1 ...
##   ..- attr(*, "discretized:breaks")= num  11 12 14 18 25
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ PerformanceRating_Group       : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 1 1 1 1 ...
##  $ RelationshipSatisfaction_Group: Factor w/ 4 levels "1","2","3","4": 3 2 1 4 4 3 4 2 2 2 ...
##  $ StockOptionLevel_Group        : Factor w/ 4 levels "0","1","2","3": 3 2 1 2 1 2 2 1 2 2 ...
##  $ TotalWorkingYears_Group       : Factor w/ 4 levels "[0,6)","[6,10)",..: 1 2 3 3 3 3 2 1 1 1 ...
##   ..- attr(*, "discretized:breaks")= num  0 6 10 13 28
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ TrainingTimesLastYear_Group   : Factor w/ 3 levels "[0,2)","[2,3)",..: 2 3 3 3 3 2 3 2 3 3 ...
##   ..- attr(*, "discretized:breaks")= num  0 2 3 6
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ WorkLifeBalance_Group         : Factor w/ 4 levels "1","2","3","4": 2 3 3 3 1 2 3 3 3 3 ...
##  $ YearsAtCompany_Group          : Factor w/ 4 levels "[0,3)","[3,5)",..: 1 3 3 4 4 4 2 2 3 2 ...
##   ..- attr(*, "discretized:breaks")= num  0 3 5 9 40
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ YearsInCurrentRole_Group      : Factor w/ 4 levels "[0,2)","[2,3)",..: 1 2 3 4 4 3 3 3 3 3 ...
##   ..- attr(*, "discretized:breaks")= num  0 2 3 7 18
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ YearsSinceLastPromotion_Group : Factor w/ 2 levels "[0,1)","[1,15]": 2 2 1 2 2 2 1 2 1 1 ...
##   ..- attr(*, "discretized:breaks")= num  0 1 15
##   ..- attr(*, "discretized:method")= chr "frequency"
##  $ YearsWithCurrManager_Group    : Factor w/ 4 levels "[0,2)","[2,3)",..: 2 3 3 1 3 3 2 2 3 2 ...
##   ..- attr(*, "discretized:breaks")= num  0 2 3 7 14
##   ..- attr(*, "discretized:method")= chr "frequency"
write.csv(employee_factors,'employee_factors.csv')

Data Visualization

Conduct exploratory data analysis (EDA): derive descriptive statistics and apply data visualization to check for interesting data patterns.

Descriptive Statistics Analysis

Based on Attrition (Yes and No), calculate the mean value of each numeric columns and compare the results of different groups.

Yes <- as.data.frame(employee_attrition %>% 
  group_by(Attrition) %>% 
  filter(Attrition=='Yes') %>%
  summarise(m_Age = mean(Age),
            m_Drate = mean(DailyRate), 
            m_Distance = mean(DistanceFromHome), 
            m_Edu = mean(Education),
            m_Envir_Sat = mean(EnvironmentSatisfaction),
            m_JobEnvolve = mean(JobInvolvement),
            m_JobLevel = mean(JobLevel),
            m_JobSatis = mean(JobSatisfaction),
            m_PerfRate = mean(PerformanceRating),
            m_StockOption = mean(StockOptionLevel),
            m_RelationSat = mean(RelationshipSatisfaction),
            m_NumberCompanyWork = mean(NumCompaniesWorked),
            m_TotalWorkYear = mean(TotalWorkingYears),
            m_Worklife_Balance = mean(WorkLifeBalance),
            m_TrainTime_LastYear = mean(TrainingTimesLastYear),
            m_Work_Company = mean(YearsAtCompany),
            m_CurrentPos_Year = mean(YearsInCurrentRole),
            m_Promotion_Year = mean(YearsSinceLastPromotion),
            m_CurrentManager_Year = mean(YearsWithCurrManager)))
##      m_Age  m_Drate m_Distance       m_Edu m_Envir_Sat m_JobEnvolve
## 1 3.925396 73.25117  -2.270155 0.002951384   0.3292925    0.2285738
##   m_JobLevel m_JobSatis  m_PerfRate m_StockOption m_RelationSat
## 1    0.47467  0.4025579 -0.02503758     0.2888913    0.07395403
##   m_NumberCompanyWork m_TotalWorkYear m_Worklife_Balance
## 1          -0.3505097        3.214653          0.1306151
##   m_TrainTime_LastYear m_Work_Company m_CurrentPos_Year m_Promotion_Year
## 1            0.2449485       2.080196          1.538961        0.4323231
##   m_CurrentManager_Year
## 1                1.3953


After comparing the results, one should eliminate the columns which have little difference.

employee_attrition %>% 
  group_by(Attrition) %>% 
  summarise(m_Age = mean(Age), 
            m_Drate = mean(DailyRate), 
            m_Distance = mean(DistanceFromHome), 
            m_Joblevel = mean(JobLevel),
            m_Total_work_year = mean(TotalWorkingYears),
            m_Year_at_Company = mean(YearsAtCompany),
            m_Year_current_position = mean(YearsInCurrentRole),
            m_Year_promotion = mean(YearsSinceLastPromotion),
            m_Year_current_manager = mean(YearsWithCurrManager))
##      m_Age  m_Drate m_Distance m_Joblevel m_Total_work_year
## 1 36.92845 800.3893   9.309199   2.064736          10.29727
##   m_Year_at_Company m_Year_current_position m_Year_promotion
## 1          6.903748                4.139693         2.120954
##   m_Year_current_manager
## 1               3.910562

Correlation Plot

cor_matrix <- cor(employee_attrition[complete.cases(employee_attrition), sapply(employee_attrition, is.numeric)], method = "pearson")
corrplot(cor_matrix,type = 'upper', tl.col = "black")

The Histograms of Numeric Variables

employee_attrition %>% keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(x = value)) +
  facet_wrap(~key, scales = "free") +
  geom_histogram(bins = 15, fill = "lightblue", color = "white") +
  labs(title = "The Histogram of Numeric Columns")

Specific Graphs of Each Numeric Variable

Barplt of Age Variable
Summary: The mean of number of employees who tend to leave is lower than that of employees who tend not to leave. Besides, the distribution of “not to leave” group follows standard deviation model while that of “to leave” group is right-skewed, implying young employees show more willingness to leave the company.

cdat <- ddply(employee_attrition, "Attrition", summarise, Age.mean = mean(Age))
p<-ggplot(employee_attrition, aes(x = Age, fill = Attrition)) +
  geom_histogram(binwidth = 3, alpha = .5, position = "identity")+
  geom_vline(data = cdat, aes(xintercept = Age.mean),
             linetype = "dashed", size = 1, color = c("pink", "lightblue"))
p<-ggplotly(p)
p

Barplot of DailyRate Variable
Summary: From thsi plot, one can only summarize that the mean of number of employees who tend to leave is lower than that of employees who tend not to leave. The distributions are randomly, implying that DailyRate might not have great influence to turnover rate.

cdat <- ddply(employee_attrition, "Attrition", summarise, DailyRate.mean = mean(DailyRate))
p<-ggplot(employee_attrition, aes(x = DailyRate, fill = Attrition)) +
  geom_histogram(binwidth = 100, alpha = .5, position = "identity")+
  geom_vline(data = cdat, aes(xintercept = DailyRate.mean),
             linetype = "dashed", size = 1, color = c("pink", "lightblue"))
p<-ggplotly(p)
p

Barplot of DistanceFromHowe Variable
Generally, the distance from home has little influence to the turnover rate, except for those who live far enough.

cdat <- as.data.frame(ddply(employee_attrition, c("DistanceFromHome_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = DistanceFromHome_Group, y = Attrition.Number, fill = Attrition)) +
  geom_bar(alpha = .5, stat = "identity", position = position_dodge())
p<-ggplotly(p)
p

Barplot of JobLevel Variable
The turnover rate tends to declines significantly for employees in job with higher level.

cdat <- as.data.frame(ddply(employee_attrition, c("JobLevel_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = JobLevel_Group, y = Attrition.Number, fill = Attrition)) +
  geom_bar(alpha = .5, stat = "identity", position = position_dodge())
p<-ggplotly(p)
p

Barplot of Total Work Year Variable
Summary: The turnover rate tends to decrease for employees who have more working experience (years).

cdat <- as.data.frame(ddply(employee_attrition, c("TotalWorkingYears_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = TotalWorkingYears_Group, y = Attrition.Number, fill = Attrition)) +
  geom_bar(alpha = .5, stat = "identity", position = position_dodge()) +
  geom_hline(yintercept = mean(employee_attrition$TotalWorkingYears[employee_attrition$Attrition == 'Yes']), col = "lightblue") +
  geom_hline(yintercept = mean(employee_attrition$TotalWorkingYears[employee_attrition$Attrition == 'No']), col = "pink")
p<-ggplotly(p)
p

Barplot of Year at Company Variable
Summary: Generally, the turnover rate tends to decrease for employees who spend more time in this company.

cdat <- as.data.frame(ddply(employee_attrition, c("YearsAtCompany_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = YearsAtCompany_Group, y = Attrition.Number, fill = Attrition)) +
  geom_bar(alpha = .5, stat = "identity", position = position_dodge()) +
  geom_hline(yintercept = mean(employee_attrition$YearsAtCompany[employee_attrition$Attrition == 'Yes']), col = "lightblue") +
  geom_hline(yintercept = mean(employee_attrition$YearsAtCompany[employee_attrition$Attrition == 'No']), col = "pink")
p<-ggplotly(p)
p

Barplt of Year in Current Position Variable
Summary: The turnover rate tends to decrease for employees who spend more time in current position. However, for group range from 7 to 18, the turnover rate increases sightly.

cdat <- as.data.frame(ddply(employee_attrition, c("YearsInCurrentRole_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = YearsInCurrentRole_Group, y = Attrition.Number, fill = Attrition)) +
  geom_bar(alpha = .5, stat = "identity", position = position_dodge()) +
  geom_hline(yintercept = mean(employee_attrition$YearsInCurrentRole[employee_attrition$Attrition == 'Yes']), col = "lightblue") +
  geom_hline(yintercept = mean(employee_attrition$YearsInCurrentRole[employee_attrition$Attrition == 'No']), col = "pink")
p<-ggplotly(p)
p

Barplt of Year since Last Promotion
Summary: YearSinceLastPromotion has little influence to the turnover rate.

cdat <- as.data.frame(ddply(employee_attrition, c("YearsSinceLastPromotion_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = YearsSinceLastPromotion_Group, y = Attrition.Number, fill = Attrition)) +
  geom_bar(alpha = .5, stat = "identity", position = position_dodge()) +
  geom_hline(yintercept = mean(employee_attrition$YearsSinceLastPromotion[employee_attrition$Attrition == 'Yes']), col = "lightblue") +
  geom_hline(yintercept = mean(employee_attrition$YearsSinceLastPromotion[employee_attrition$Attrition == 'No']), col = "pink")
p<-ggplotly(p)
p

Barplt of Year with Current Manager
Summary: The turnover rate tends to decrease for employees who spend more time with current manager.

cdat <- as.data.frame(ddply(employee_attrition, c("YearsWithCurrManager_Group", "Attrition"), summarise, Attrition.Number = length(Attrition)))
p<-ggplot(cdat, aes(x = YearsWithCurrManager_Group, y = Attrition.Number, fill = Attrition)) +
  geom_bar(alpha = .5, stat = "identity", position = position_dodge()) +
  geom_hline(yintercept = mean(employee_attrition$YearsWithCurrManager[employee_attrition$Attrition == 'Yes']), col = "lightblue") +
  geom_hline(yintercept = mean(employee_attrition$YearsWithCurrManager[employee_attrition$Attrition == 'No']), col = "pink")
p<-ggplotly(p)
p


Association Rules Model

Baseline Model

Utilizing asscoiation rules to create a baseline model (setting all the hyper-paramter to default value)

factor_trans <- as(employee_factors[, sapply(employee_factors, is.factor)], "transactions")
baseline <- apriori(factor_trans)

Result of Baseline Model

inspect(head(sort(baseline, by = "lift", decreasing = T), 5))
##     lhs                                      rhs                                support confidence     lift count
## [1] {YearsAtCompany_Group=[0,3),                                                                                 
##      YearsSinceLastPromotion_Group=[0,1),                                                                        
##      YearsWithCurrManager_Group=[0,2)}    => {YearsInCurrentRole_Group=[0,2)} 0.1328790  0.9873418 4.932507   156
## [2] {PerformanceRating_Group=3,                                                                                  
##      YearsAtCompany_Group=[0,3),                                                                                 
##      YearsSinceLastPromotion_Group=[0,1),                                                                        
##      YearsWithCurrManager_Group=[0,2)}    => {YearsInCurrentRole_Group=[0,2)} 0.1090290  0.9846154 4.918887   128
## [3] {YearsAtCompany_Group=[0,3),                                                                                 
##      YearsSinceLastPromotion_Group=[0,1)} => {YearsInCurrentRole_Group=[0,2)} 0.1345826  0.9294118 4.643104   158
## [4] {PerformanceRating_Group=3,                                                                                  
##      YearsAtCompany_Group=[0,3),                                                                                 
##      YearsSinceLastPromotion_Group=[0,1)} => {YearsInCurrentRole_Group=[0,2)} 0.1107325  0.9285714 4.638906   130
## [5] {BusinessTravel=Travel_Rarely,                                                                               
##      YearsAtCompany_Group=[0,3),                                                                                 
##      YearsWithCurrManager_Group=[0,2)}    => {YearsInCurrentRole_Group=[0,2)} 0.1013629  0.9083969 4.538119   119

Visualize the Baseline Model: Top 10 Item Frequency Plot

itemFrequencyPlot(factor_trans, topN = 10, type = "absolute", main = "Item Frequency")

plot(baseline, jitter = 0)


Summary and Analysis:
One can conclude that with decrease in support, both the values of confidence and lift will increase.

For group with Attribution = Yes

Setting the support to 0.01 implicates that the combination of rules should at least appears twise in the transaction set.
Setting the confidence to 0.8 implicates that the relationship between lhs and rhs are strong enough to support their correlation.
Besides, using as.subset() function can help the programmer to prune redundant transaction rules.

rules_yes <- apriori(factor_trans, 
                 parameter = list(support = 0.01, confidence = 0.8, minlen = 3),
                 appearance = list(default = "lhs", rhs = c("Attrition=Yes")), 
                 control = list(verbose = F))
subset_rules <- which(colSums(is.subset(rules_yes, rules_yes)) > 1)
rules_yes <- rules_yes[-subset_rules]
result_yes <- head(sort(rules_yes, by = "lift", decreasing = T), 10)

Checking the Top 5 rules

inspect(head(sort(rules_yes, by = "lift", decreasing = T), 5))
##     lhs                                       rhs                support confidence     lift count
## [1] {BusinessTravel=Travel_Frequently,                                                            
##      MaritalStatus=Single,                                                                        
##      Age_Group=young,                                                                             
##      YearsAtCompany_Group=[0,3)}           => {Attrition=Yes} 0.01192504          1 6.345946    14
## [2] {BusinessTravel=Travel_Frequently,                                                            
##      Age_Group=young,                                                                             
##      StockOptionLevel_Group=0,                                                                    
##      YearsAtCompany_Group=[0,3)}           => {Attrition=Yes} 0.01277683          1 6.345946    15
## [3] {MaritalStatus=Single,                                                                        
##      Age_Group=young,                                                                             
##      MonthlyRate_Group=medium,                                                                    
##      YearsInCurrentRole_Group=[0,2)}       => {Attrition=Yes} 0.01022147          1 6.345946    12
## [4] {MaritalStatus=Single,                                                                        
##      Age_Group=young,                                                                             
##      MonthlyRate_Group=medium,                                                                    
##      TotalWorkingYears_Group=[0,6),                                                               
##      YearsSinceLastPromotion_Group=[0,1)}  => {Attrition=Yes} 0.01022147          1 6.345946    12
## [5] {OverTime=Yes,                                                                                
##      Age_Group=young,                                                                             
##      JobSatisfaction_Group=unsatisfactory,                                                        
##      MonthlyIncome_Group=low,                                                                     
##      YearsSinceLastPromotion_Group=[0,1)}  => {Attrition=Yes} 0.01022147          1 6.345946    12

Visualize those rules

plot(result_yes, method = "paracoord", control = list(reorder = T))

plot(rules_yes, measure = c("support", "lift"), shading = "confidence")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(result_yes, method = "graph")

For group with Attribution = No

Setting the support to 0.1 implicates that the combination of rules should at least appears 50 times in the transaction set.
Setting the confidence to 0.9 implicates that the relationship between lhs and rhs are strong enough to support their correlation.
Besides, using as.subset() function can help the programmer to prune redundant transaction rules.

rules_no <- apriori(factor_trans, 
                 parameter = list(support = 0.1, confidence = 0.90, minlen = 4),
                 appearance = list(default = "lhs", rhs = c("Attrition=No")), 
                 control = list(verbose = F))
subset_rules <- which(colSums(is.subset(rules_no, rules_no)) > 1)
rules_no <- rules_no[-subset_rules]
result_no <- head(sort(rules_no, by = "lift", decreasing = T), 15)

The Top 5 rules

inspect(head(sort(rules_no, by = "lift", decreasing = T), 5))
##     lhs                                       rhs              support confidence     lift count
## [1] {OverTime=No,                                                                               
##      JobSatisfaction_Group=satisfactory,                                                        
##      YearsAtCompany_Group=[5,9)}           => {Attrition=No} 0.1286201  0.9805195 1.163933   151
## [2] {OverTime=No,                                                                               
##      JobLevel_Group=2,                                                                          
##      JobSatisfaction_Group=satisfactory}   => {Attrition=No} 0.1567291  0.9684211 1.149572   184
## [3] {Department=Research & Development,                                                         
##      MonthlyIncome_Group=high,                                                                  
##      TrainingTimesLastYear_Group=[3,6]}    => {Attrition=No} 0.1022147  0.9677419 1.148765   120
## [4] {OverTime=No,                                                                               
##      YearsAtCompany_Group=[5,9),                                                                
##      YearsSinceLastPromotion_Group=[1,15]} => {Attrition=No} 0.1149915  0.9642857 1.144663   135
## [5] {OverTime=No,                                                                               
##      JobSatisfaction_Group=satisfactory,                                                        
##      MonthlyIncome_Group=medium}           => {Attrition=No} 0.1371380  0.9640719 1.144409   161

Visualize the rules

plot(result_no, method = "paracoord", control = list(reorder = T))

plot(rules_no, measure = c("support", "lift"), shading = "confidence")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

Visualize the rules

plot(result_no, method = "graph")

Analysis Report

For Association Rules Model:
One should run the models several times with different hyper-parameters in order to optimize the results. Even though it seems that the higher the values of support and confidence, the better the result will be, the selection will alter with the purposes of programmer. Besides, one should avoid to set a extrem value (such as setting support to 1.0), which might cause overfitting problem.

According to the analysis illustrated above, there are several factors contributes to “Attribution=Yes” attribute: * BusinessTravel = Travel_Frequently
* Age_Group = Young
* JobLevel_Group = 1
* OverTime = Yes
* JobSatisfaction_Goup = Unsatisfactory
* YearAtCompany_Group = [0,3)
* StockOptionLevel_Group = 0

Hence, one can summarise that those new-enrolled, young employees who travel frequently and work overtime frquently tend to level the company, especially in cases that they feel unsatisfactory about their works and have limited option for stock.

For “Attribution=No” group, the significant factors are: * OverTime = No
* JobSatisfaction_Goup = Satisfactory
* YearAtCompany_Group = [5,9)
* StockOptionLevel_Group = 1

Hence, employees who do not work overtime and satisfy their job tend not to level the company, especially for those who have option for stock and work in this company above 5 years.


Recommendation
In this situation, the manager can investigate features that bring negative influence to employees and then launch some activities to reduce the turnover rate. For example, manager can take inititive to reducev the overtime rate, especially for young employees.


Finally
This is the end of this presentation. If you want to find more interesting data paterns. You can click the link below to talor your own Association Rule Model. Hope you enjoy it!
Shiny App web link
Reminder: Please be patient about the fleshing speed of this App.